Extractions des sujets des reviews¶
Prévisualiastions des datasets¶
Nom du fichier Taille du fichier ----------------------------------- ------------------- yelp_academic_dataset_business.json 0.11Gb yelp_academic_dataset_checkin.json 0.27Gb yelp_academic_dataset_review.json 4.98Gb yelp_academic_dataset_tip.json 0.17Gb yelp_academic_dataset_user.json 3.13Gb
Prévualisation du DataSet: "business"
| business_id | name | address | city | state | postal_code | latitude | longitude | stars | review_count | is_open | attributes | categories | hours | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 46 | JX4tUpd09YFchLBuI43lGw | Naked Cyber Cafe & Espresso Bar | 10303 108 Street NW | Edmonton | AB | T5J 1L7 | 53.544682 | -113.506589 | 4.0 | 12 | 1 | {'OutdoorSeating': 'False', 'BusinessParking':... | Arts & Entertainment, Music Venues, Internet S... | {'Monday': '11:0-1:0', 'Tuesday': '11:0-1:0', ... |
| 214 | LVYAXWQB3t7tdwWteyjfhw | Option 1 Barber Shop | 5537 Sheldon Rd, Ste E | Tampa | FL | 33615 | 27.998700 | -82.582253 | 4.0 | 16 | 0 | {'ByAppointmentOnly': 'False', 'BusinessParkin... | Barbers, Beauty & Spas | {'Monday': '9:0-19:0', 'Tuesday': '9:0-19:0', ... |
| 736 | S2LinHvVEXAm2jv84_kXLw | St. Louis Artists' Guild | 2 Oak Knoll Park | Saint Louis | MO | 63105 | 38.637678 | -90.319549 | 4.0 | 9 | 0 | {'RestaurantsPriceRange2': '2', 'BusinessAccep... | Festivals, Art Galleries, Local Flavor, Commun... | {'Tuesday': '12:0-16:0', 'Wednesday': '12:0-16... |
| 992 | 7zfO3VB6wEqDnk0_U16uBg | Maximum Grow Gardening | 6117 E Washington St | Indianapolis | IN | 46219 | 39.771263 | -86.061167 | 4.0 | 5 | 1 | {'BikeParking': 'True', 'BusinessAcceptsCredit... | Home Services, Home & Garden, Shopping, Garden... | {'Monday': '10:30-19:0', 'Tuesday': '10:30-19:... |
| 974 | zKBIdA2j49REmU2bFR1mdw | BayLife Physical Therapy and Rehabilitation- S... | 8950 Doctor Martin Luther King Junior St N, St... | St. Petersburg | FL | 33702 | 27.854197 | -82.647427 | 5.0 | 9 | 1 | {'ByAppointmentOnly': 'True', 'BusinessAccepts... | Rehabilitation Center, Health & Medical, Physi... | {'Monday': '8:0-19:0', 'Tuesday': '8:0-19:0', ... |
business_id SFKjUQ1gmfwm7cJhMCFmkA
name ZigZag Scallop
address 4417 Calienta St
city Hernando Beach
state FL
postal_code 34607
latitude 28.4977
longitude -82.650015
stars 4.0
review_count 214
is_open 1
attributes {'Alcohol': ''full_bar'', 'GoodForMeal': '{'de...
categories Nightlife, Seafood, Bars, Restaurants, America...
hours {'Monday': '11:30-21:0', 'Thursday': '11:30-21...
Name: 595, dtype: object
Prévualisation du DataSet: "checkin"
| business_id | date | |
|---|---|---|
| 398 | -Awb67JgBbySP4mQtOtNsA | 2011-09-24 15:48:25, 2012-04-19 17:24:46, 2012... |
| 292 | -7aDp7JsogemWTKOuZdNTw | 2011-01-28 18:01:12, 2011-03-18 04:17:53, 2011... |
| 557 | -FeeEqYmAJ00G66ngdxFZg | 2014-08-24 15:52:47, 2014-09-14 17:59:58 |
| 802 | -OKB11ypR4C8wWlonBFIGw | 2010-03-21 01:26:00, 2010-03-27 05:43:30, 2010... |
| 261 | -6qt8a52bBwMogqwZsooOA | 2021-06-11 17:36:41, 2021-07-03 16:23:55, 2021... |
business_id -1K1J_D9eT2dR6BNvQ2Tnw date 2010-10-19 23:45:27, 2011-12-26 20:17:49, 2012... Name: 82, dtype: object
Prévualisation du DataSet: "review"
| review_id | user_id | business_id | stars | useful | funny | cool | text | date | |
|---|---|---|---|---|---|---|---|---|---|
| 787 | zBrO_zs81k9U-ZpyR_p8fw | FsbQY_iNJPm4xAZ9vERCBw | Jx2AoB_IQOUrZ3s6fdAUSA | 5 | 0 | 0 | 0 | Great Cranberry Orange Muffin and Caffe Latte ... | 2017-05-13 13:05:01 |
| 263 | wEfzqOfbwn4Ohe2ZDOLAzw | VMtyZjaEJB9nfmjr4xdVlw | GBTPC53ZrG1ZBY3DT8Mbcw | 4 | 1 | 1 | 0 | First meal in New Orleans. I had the $15 lunch... | 2012-11-06 22:28:18 |
| 415 | ijG3hyvzneIplamARXzfEg | TDGx8YhxmF3OP0XH3dYsXw | SoJwDKedR7SJh7-G69C38A | 5 | 0 | 0 | 0 | no contest best Chinese good in the area. owne... | 2016-11-09 23:41:55 |
| 176 | x-1wBBwja9l2Hr5bgqsG0A | L-2Qdi16eMRbATGDP6ADHg | mQvRi0nm84Www71d4qOheQ | 5 | 1 | 0 | 1 | Excellent food. We tried three appetizers and ... | 2016-07-03 21:13:10 |
| 674 | YFp9hHkElfJGvvdo5T9MuA | 98jv8gu7kAwa2WzIPdw6-w | _RwlMTw9uFeOkfX9Ctf1HA | 3 | 0 | 0 | 0 | While I generally support independent restaura... | 2015-01-04 00:57:18 |
review_id vqmhsvXK9z4TTvnVDNpPDQ user_id ziNigH8BY9gRDvrmSsJTOw business_id ICqgjbOpBD9SUtE5PQC9sA stars 5 useful 0 funny 0 cool 0 text Fun, no-frills atmosphere, right on the water,... date 2018-02-23 23:06:03 Name: 235, dtype: object
Prévualisation du DataSet: "tip"
| user_id | business_id | text | date | compliment_count | |
|---|---|---|---|---|---|
| 300 | Um5bfs5DH6eizgjH3xZsvg | Dv1SSVUWj1qmvAaSuRiCdg | Best brunch in the Clearwater area. Come and c... | 2014-03-14 16:57:19 | 0 |
| 945 | X1nvKXUJ5Lp3W9Oe-_JrMQ | fo4TOAiwYEZ5p13kqg1Ukw | Use the phone app and preload for ease of paym... | 2013-12-29 18:16:29 | 0 |
| 666 | cLCMvwsFgKx2f6rrK3boOQ | 434A83c2ig6QxsZjrjclpQ | Spinach dip is amazing! Come at least 30 minut... | 2013-08-23 14:10:54 | 0 |
| 615 | 0xZucjnNt2beD1veIAWLwA | h7Fq7pBe2uMD5doA91j6XQ | Yazoo Pale Ale on draft & live music! | 2011-09-30 22:09:32 | 0 |
| 258 | njcXFGqIuSp-_joP42MhxA | ltBBYdNzkeKdCNPDAsxwAA | The kids are eating...it's a miracle... | 2011-08-20 19:29:22 | 0 |
user_id weB8wGdi1A1SXh8CMCXDTw business_id VZWuhqiPJCZmHfJmwdiCGA text Thursday - 1/2 price wine and fish & chip special date 2018-03-30 02:36:09 compliment_count 0 Name: 689, dtype: object
Prévualisation du DataSet: "user"
| user_id | name | review_count | yelping_since | useful | funny | cool | elite | friends | fans | ... | compliment_more | compliment_profile | compliment_cute | compliment_list | compliment_note | compliment_plain | compliment_cool | compliment_funny | compliment_writer | compliment_photos | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 73 | KxrKVxdXGkfMJ9XwJZzoLQ | Lisa | 950 | 2008-10-16 21:03:02 | 1243 | 368 | 527 | 2010,2011,2012,2013,2014,2015,2016,2017,2018,2... | EWNe5k2pLEefqWnAdC4_1A, C44UVPGmzusO4_a576Wa6g... | 35 | ... | 2 | 1 | 1 | 1 | 26 | 51 | 34 | 34 | 11 | 5 |
| 436 | fHS0bQ-l5rHME_xXKQSYXQ | Kevin | 1401 | 2007-03-19 18:19:11 | 7875 | 3954 | 6616 | 2007,2008,2009,2010,2011,2012 | 4Zi2HXp_uEjAgJHTvIsCXg, BWsutShwFQiQMoITF9IMOg... | 383 | ... | 49 | 31 | 46 | 75 | 515 | 1589 | 947 | 947 | 264 | 60 |
| 633 | xQXUG94oRQxYUHZwS6Cwzg | Michelle | 943 | 2009-05-31 05:08:55 | 4092 | 2315 | 2561 | 2009,2010,2011,2012,2013,2014,2015,2016,2017 | NdHGV2JmZmhYG1tSCRwrBg, wr1Y8-yLVCoaDKR4ihiTGg... | 69 | ... | 64 | 13 | 15 | 6 | 357 | 537 | 561 | 561 | 180 | 34 |
| 700 | 0iq45WV_h_j8SXjR4ytcJg | A | 186 | 2009-12-08 23:21:45 | 669 | 301 | 379 | dXpgRqVIJZZcD87Tf9DtUQ, 5pDJuri4g3Wfes8GGlnudA... | 46 | ... | 2 | 1 | 0 | 0 | 7 | 22 | 18 | 18 | 4 | 4 | |
| 192 | NRrNQ5xHn_7Fu4ctlpKLbQ | Justin | 1002 | 2006-12-31 06:40:43 | 1256 | 639 | 549 | 2015,2016,2017,2018,2019,20,20,2021 | tVzBb8_2bkknwBZIbSq3hQ, vQ4IV9xP_t-lfj0eGE46hA... | 51 | ... | 6 | 2 | 2 | 1 | 26 | 50 | 48 | 48 | 16 | 3 |
5 rows × 22 columns
user_id z3yOJaNdvqzXc6L1RbTl_w name rebecca review_count 107 yelping_since 2006-12-04 19:03:52 useful 152 funny 42 cool 43 elite 2010 friends hZg-KQusgDFRrcGTRUuSWA, NFYDEgblCeBMlwdpL_UsRA... fans 1 average_stars 3.06 compliment_hot 1 compliment_more 2 compliment_profile 0 compliment_cute 1 compliment_list 0 compliment_note 6 compliment_plain 3 compliment_cool 1 compliment_funny 1 compliment_writer 0 compliment_photos 0 Name: 644, dtype: object
23% des reviews ont au maximum 2 étoiles
Apperçu des catégories
- Doctors, Traditional Chinese Medicine, Naturopathic/Holistic, Acupuncture, Health & Medical, Nutritionists
- Shipping Centers, Local Services, Notaries, Mailbox Centers, Printing Services
- Department Stores, Shopping, Fashion, Home & Garden, Electronics, Furniture Stores
- Restaurants, Food, Bubble Tea, Coffee & Tea, Bakeries
- Brewpubs, Breweries, Food
- Burgers, Fast Food, Sandwiches, Food, Ice Cream & Frozen Yogurt, Restaurants
- Sporting Goods, Fashion, Shoe Stores, Shopping, Sports Wear, Accessories
- Synagogues, Religious Organizations
- Pubs, Restaurants, Italian, Bars, American (Traditional), Nightlife, Greek
- Ice Cream & Frozen Yogurt, Fast Food, Burgers, Restaurants, Food
- Department Stores, Shopping, Fashion
- Vietnamese, Food, Restaurants, Food Trucks
- American (Traditional), Restaurants, Diners, Breakfast & Brunch
- General Dentistry, Dentists, Health & Medical, Cosmetic Dentists
- Food, Delis, Italian, Bakeries, Restaurants
- Sushi Bars, Restaurants, Japanese
- Automotive, Auto Parts & Supplies, Auto Customization
- Vape Shops, Tobacco Shops, Personal Shopping, Vitamins & Supplements, Shopping
- Automotive, Car Rental, Hotels & Travel, Truck Rental
- Korean, Restaurants
- etc...
Apperçu de quelques reviews par note
Note = 1
They have the WORST service advisors! Used to be good before Kelly and her team left. Unfortunately, it's convenient to work if I need oil change before I can make it to another Honda dealer.
It is unfortunate that with such a unique location and such a brand and product offering this specific store offers such lousy service. The wait is endless, no one is available to help and at Christmas time getting a gift wrap is act of God that requires endless wait. I bought gifts and knew that the wait for wrapping would be long SO I even left my items at the store to be gift wrapped at their leisure. They were not even moved from the counter where I bought them when I returned almost two hours later ready for pick up. This was a gift that needed to be given and The staff COMPLETELY "dropped the ball" on my time constraints!
I love their stuff, but today was my last shopping experience at this location: couldn't get a gift wrapped after being assured that it could be done in a timely fashion??? I'll cancel my card, do everything online and try not to go there if I can. It's really a shame!
Note = 2
We arrived for lunch at 12p on a Friday - wasn't busy at all. It took FOREVER to get anything. We were really pleasant and kind but we had to complain several times to get any beverages that we ordered. It took them 40 minutes to let us know that a beer we ordered was no longer available. The food was great, but not worth the lengthy wait and slow service.
This use to be a reliable place for sandwiches but the last two were not good at all. Cheesesteak was light on the meat. There are so many good rolls to pick from in the Philly area but this roll was one of the worst and stale.
Hopefully it was just an off night. I'll try again but not for awhile.
Note = 3
Honestly the food doesn't knock my socks off but other people seem to love this place. I go because my husband likes it as for me I'd rather go to a different BBQ spot. I guess it also depends on what you order.
If not for the pretentious, haughty, superior attitude of our waiter, I would have given this place four stars or possibly more... Seriously. That kind of attitude is exactly why I left New York. We wanted to order a bottle of wine, asked for his suggestion, and he answers with, "How much do you want to spend?" Ummm... Excuse me? How about, "I'm happy to make some recommendations. Let's find a price point you're comfortable with..." He didn't smile ONCE during the meal service and also found it necessary to correct us on several points of preference. Snob. The food is GOOD. The only thing that was great was the blueberry lasagna. And it is superb. So was the chocolate confection for dessert. I'd say go. And I hope you don't get that waiter.
Note = 4
I love the concept of this place. One half of it a café that serves different types of coffees and teas, and breakfast type items, and the other world where it serves bar type apps, salads, and sandwiches, Can't forget that they also serve a wide variety of beers that they serve on tap, bottled, or in can.
This place is pretty darn casual, and one can hang out here for hours with their friends. Reminds me of the good ol' college days when no time could pass especially on a chill out Sunday Fun day.
Bring your four legged friend too. They are totally welcome.
We had an alumni event here and I really enjoyed it. It's dimly lit, very cozy inside - it's decorated kind of like an old library. It was a great quiet,casual spot if you're looking for a low key place to have good drinks.
They have a few beers and wines, but we focused mostly on the happy hour cocktails - moscow mules, old fashioneds, etc. Everyone loved theirs, no complaints!
Note = 5
We sat at a pretty hectic lunch at Johnny rockets in the casino. Our server was Lyndel! She was awesome! Helped us at every needy request lol... Good was good, too! I'm too full
Great store. Insane selection. Incredible customer service.
Wish they could come to Ft. McMurray.
:(
Extraction d'un échantillons de reviews¶
52268 business sont des Restaurants
Caractéristique de l'extraction
- par chunk de 100000
- Filtrage des reviews sur categorie "restaurants"
- Séparation par note
- Ajout des infos business
- Limite en quantité: 1000 reviews par note
- Stockage dans des fichiers parquets
Nuages de mots par note¶
Note = 5
Note = 4
Note = 3
Note = 2
Note = 1
Recherche des sujets d'insatisfactions¶
Sac de mots (TF-IDF)¶
Vectorisation pour les reviews comportant maximum 2 étoiles Nombre de textes dupliqués supprimés: 0 Il y a 2000 enregistrements Vecteurs TF-IDF des reviews: ====================
| 100 | 1st | 2nd | able | absolute | absolutely | accept | accommodate | acknowledge | across | ... | yelp | yes | yesterday | yet | york | young | yuck | yummy | zero | zero star | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.231769 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 |
| 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 |
| 2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.176228 | 0.0 |
| 3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 |
| 4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 |
5 rows × 1901 columns
+------------------------------------------------------------------------------------------------------+--------------------------+ | text review | vecteur tf-idf | +======================================================================================================+==========================+ | You don't accept cash? I don't think you grasp the ramifications of such a corpo-fascist economic | accept: 0.2965 | | principle. No room for arrogant commies in my diet thank you so very very little. Can't wait to | anyone: 0.2395 | | see this place nosedive For anyone in the dark about this policy, watch Mike Judge's film, | attention: 0.2499 | | Idiocracy. Pay close attention throughout the hospital scene. "Unscannable!!!" | cash: 0.2758 | | | close: 0.1923 | | | dark: 0.2780 | | | diet: 0.3070 | | | little: 0.1816 | | | pay: 0.1711 | | | room: 0.2075 | | | see: 0.1698 | | | thank: 0.2486 | | | throughout: 0.2851 | | | very little: 0.2826 | | | very very: 0.3111 | | | watch: 0.2318 | +------------------------------------------------------------------------------------------------------+--------------------------+ | Visiting from out of town I was excited to visit Chef Milly's newly opened restaurant. It was a | chef: 0.3954 | | sore DISAPPOINTMENT. Waited even w/a reservation and tried to be patient since they had just | definitely: 0.1777 | | recently opened. Chef Milly does not even offer a smile to his visitors. The food was just OK and | definitely not: 0.2266 | | definitely not worth the wait. I will commend my server (I think his name was Ty) on his | disappointment: 0.1942 | | professionalism even amongst the multiple jobs he was tasked. Chef Ramsey would be highly | excite: 0.1864 | | disappointed - I surely was. | highly: 0.2429 | | | job: 0.1992 | | | multiple: 0.2079 | | | name: 0.1809 | | | not even: 0.1912 | | | not worth: 0.1877 | | | offer: 0.1547 | | | open: 0.2541 | | | recently: 0.2150 | | | reservation: 0.1983 | | | since: 0.1442 | | | smile: 0.2250 | | | town: 0.1905 | | | visit: 0.2657 | | | worth: 0.1579 | | | worth wait: 0.2542 | +------------------------------------------------------------------------------------------------------+--------------------------+ | Cashier was very rude fat Hispanic girl at the airport location. Very very rude. Half the menu | airport: 0.3057 | | wasn't even available. | available: 0.2564 | | | cashier: 0.2867 | | | fat: 0.2803 | | | girl: 0.2364 | | | half: 0.1883 | | | location: 0.1860 | | | rude: 0.3302 | | | very rude: 0.5024 | | | very very: 0.3208 | | | wasn even: 0.2996 | +------------------------------------------------------------------------------------------------------+--------------------------+ | The worst customer service and nasty employees EVER!!! SLOW SERVICE!!!Missing items and cashier not | call: 0.1619 | | giving correct change. I felt like she wanted my $10. Manager was no better...nasty and said to | care: 0.1853 | | call and complain...she did NOT care. NEVER again!!!! | cashier: 0.2566 | | | change: 0.1931 | | | complain: 0.2101 | | | correct: 0.2343 | | | customer service: 0.1953 | | | employee: 0.1886 | | | ever: 0.1548 | | | felt: 0.1931 | | | felt like: 0.2426 | | | item: 0.1942 | | | manager: 0.1600 | | | miss: 0.2079 | | | nasty: 0.4218 | | | not care: 0.2872 | | | not give: 0.2767 | | | slow: 0.1819 | | | slow service: 0.2587 | +------------------------------------------------------------------------------------------------------+--------------------------+ | Food was just about ok. Service very average and slow. Also, charging extra for something as simple | average: 0.2527 | | as soy sauce is a bit strange when their other condiments were complimentary. Bar staff was good | bar: 0.2019 | | though. | bite: 0.2135 | | | charge: 0.2370 | | | extra: 0.2534 | | | food service: 0.2887 | | | good though: 0.3770 | | | sauce: 0.2010 | | | service very: 0.3321 | | | simple: 0.3032 | | | slow: 0.2354 | | | something: 0.2109 | | | strange: 0.3437 | | | though: 0.2078 | +------------------------------------------------------------------------------------------------------+--------------------------+
LDA (Librairie Sklearn)¶
Recherche des sujets avec les paramètres suivants:
param valeur
----------- --------
max_stars 2
min_df 2
max_df 0.1
n_topics 3
alpha 0.5
n_top_words 5
ngram_range (1, 1)
- Vectorisation (tf-idf)
- Modélisation LDA
- Affichage des topics
Topic n° Categories
---------- ---------------------------------------
0 pizza, waitress, burger, sauce, bar
1 goopy, mahi, vista, isla, cashew
2 donut, fancy, environment, refer, bueno
Recherche des sujets avec les paramètres suivants:
param valeur
----------- --------
max_stars 2
min_df 2
max_df 0.2
n_topics 3
alpha 0.8
n_top_words 5
ngram_range (3, 3)
- Vectorisation (tf-idf)
- Modélisation LDA
- Affichage des topics
Topic n° Categories
---------- ----------------------------------------------------------------------------------------------
0 would not recommend, waste time money, wait another minute, not recommend place, give one star
1 not very good, want like place, food good service, give two star, nothing write home
2 take minute get, not worth wait, service very slow, food nothing special, not worth price
LDA (Librairie Gensim)¶
Recherche des sujets avec les paramètres suivants: param valeur --------- -------- max_stars 2 no_below 2 no_above 0.2 n_topics 3 n_grams [2, 3] - Préparation des data (preprocess tokenisation...) - LDA pour 3 topics +------------+--------------------------+ | Topic n° | mots clés | +============+==========================+ | 1 | 0.004*"taste like" | | | 0.004*"come back" | | | 0.003*"win back" | | | 0.003*"food not" | | | 0.002*"not good" | | | 0.002*"mac cheese" | | | 0.002*"wait staff" | | | 0.001*"felt like" | | | 0.001*"very good" | | | 0.001*"place not" | +------------+--------------------------+ | 2 | 0.002*"not worth" | | | 0.002*"look like" | | | 0.002*"next time" | | | 0.002*"much good" | | | 0.002*"very disappoint" | | | 0.002*"not sure" | | | 0.002*"first time" | | | 0.002*"come back" | | | 0.002*"mash potato" | | | 0.002*"not return" | +------------+--------------------------+ | 3 | 0.003*"get food" | | | 0.003*"food good" | | | 0.003*"take order" | | | 0.002*"wait minute" | | | 0.002*"would not" | | | 0.002*"take minute" | | | 0.002*"customer service" | | | 0.002*"last night" | | | 0.002*"place order" | | | 0.002*"drink order" | +------------+--------------------------+
Out[21]:
- Préparation des data (preprocess tokenisation...) - LDA pour 3 topics +------------+-----------------------------+ | Topic n° | mots clés | +============+=============================+ | 1 | 0.009*"never come back" | | | 0.008*"waste time money" | | | 0.005*"could give zero" | | | 0.004*"would not recommend" | | | 0.004*"get money back" | | | 0.004*"take drink order" | | | 0.004*"buy one get" | | | 0.004*"not come back" | | | 0.003*"take minute get" | | | 0.003*"get order right" | +------------+-----------------------------+ | 2 | 0.005*"want like place" | | | 0.005*"take minute get" | | | 0.004*"speak manager tell" | | | 0.004*"say didn know" | | | 0.004*"never come back" | | | 0.004*"make eye contact" | | | 0.004*"would not recommend" | | | 0.004*"could give zero" | | | 0.004*"waste time money" | | | 0.004*"wish could give" | +------------+-----------------------------+ | 3 | 0.005*"give zero star" | | | 0.005*"come take order" | | | 0.005*"never come back" | | | 0.004*"order wait minute" | | | 0.004*"really want like" | | | 0.003*"new york pizza" | | | 0.003*"waste time money" | | | 0.003*"food good not" | | | 0.003*"take drink order" | | | 0.003*"didn even eat" | +------------+-----------------------------+
Out[22]:
Classifications des images¶
Prévisualisation du dataset¶
Out[23]:
| photo_id | business_id | caption | label | |
|---|---|---|---|---|
| 190211 | ZLVGQMk0Z-OeFNWtlQFpGA | 1Nvx5xo_cErlEqpubzocSg | Southwest Chorizo Burger | food |
| 159695 | e-DRAYViNXQoFMtIRbR7ag | rElxptPIJZicDM39e1ORTg | food | |
| 177849 | hwlTSLEySvQnlvanyxXOyQ | EJ1r6E92bw7khcMSPH80rA | inside | |
| 107714 | oexZE1WbqWnOO8bEnFVIaQ | WnVNjr9zVEpK85T7dbAfEg | Still Life of Roll. Oh, and Wasabi ball. | food |
| 197858 | Vua3uNjizuCSsArRn1h6qg | p4zm3a5-Ei8wjUV_KZq23w | inside |
Out[24]:
(200100, 4)
Out[25]:
label food 108152 inside 56031 outside 18569 drink 15670 menu 1678 Name: count, dtype: int64
Creation dataset echantillons¶
L'échantillon contient 500 images
Out[27]:
| photo_id | label | width | height | mode | label_num | |
|---|---|---|---|---|---|---|
| 117 | yZX66Ykboo4jdWOSCW70vA | outside | 300.0 | 400.0 | RGB | 4 |
| 461 | dTw0ZNmAetdtAeFzQqtI0w | menu | 265.0 | 400.0 | RGB | 3 |
| 325 | 4ASMLlOMvMPohnjP4yFYMw | food | 533.0 | 400.0 | RGB | 1 |
| 436 | AHZeI38pU1QhUGYDxQ69jQ | menu | 533.0 | 400.0 | RGB | 3 |
| 426 | jEa3Y6D_YHrXu8ZDzHFOVw | menu | 300.0 | 400.0 | RGB | 3 |
Out[28]:
| width | height | label_num | |
|---|---|---|---|
| count | 500.000000 | 500.000000 | 500.00000 |
| mean | 438.882000 | 389.688000 | 2.00000 |
| std | 131.985303 | 32.814085 | 1.41563 |
| min | 131.000000 | 69.000000 | 0.00000 |
| 25% | 300.000000 | 400.000000 | 1.00000 |
| 50% | 408.000000 | 400.000000 | 2.00000 |
| 75% | 543.750000 | 400.000000 | 3.00000 |
| max | 600.000000 | 400.000000 | 4.00000 |
Out[29]:
mode RGB 500 Name: count, dtype: int64
Clustering par descripteurs SIFT¶
Pretraitement des images¶
Exemple de pre-traitement¶
Image et son histogramme avant traitement
Image et son histogramme après traitement
Creation des descripteurs¶
Exemple de descripteur
Descripteurs : (501, 128) [[ 0. 0. 0. ... 1. 18. 18.] [13. 14. 8. ... 24. 0. 1.] [24. 0. 0. ... 2. 5. 25.] ... [ 0. 0. 0. ... 0. 2. 8.] [ 0. 0. 0. ... 0. 2. 13.] [ 0. 0. 10. ... 0. 0. 12.]]
| photo_id | label | width | height | mode | label_num | desc | |
|---|---|---|---|---|---|---|---|
| 360 | J4fwj6iamJ7mOCzYIQCujw | food | 533.0 | 400.0 | RGB | 1 | [[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0,... |
| 214 | SqInyQ4-CgIRXhXPJzjr0w | drink | 600.0 | 400.0 | RGB | 0 | [[42.0, 74.0, 7.0, 12.0, 22.0, 21.0, 41.0, 40.... |
| 385 | AXPr8IBkgPg4Y19uzlFakw | food | 504.0 | 337.0 | RGB | 1 | [[17.0, 2.0, 2.0, 2.0, 118.0, 58.0, 5.0, 18.0,... |
Clustering des descripteurs¶
Principe:
- Il s'agit de regrouper tous les descripteurs en clusters
- Les clusters serviront ensuite à classifier les images par degré d'appartenance à chaque cluster
Il y a 494 clusteurs pour un total de 244744 descripteurs
Creations des features des images¶
Principes:
- On attribut chacun des descripeurs de l'image à un des clusters de descripteur
- Pour chacun des clusteurs on compte combien l'image contient de descripteur de ce clusteur
- On peut le visualiser en forme d'histogramme et utiliser celui comme features de l'image
Out[38]:
| photo_id | label | width | height | mode | label_num | desc | features | |
|---|---|---|---|---|---|---|---|---|
| 294 | V7yLVqDLuVC_D0xbjQcXXw | drink | 597.0 | 400.0 | RGB | 0 | [[8.0, 1.0, 1.0, 1.0, 16.0, 33.0, 11.0, 7.0, 1... | [4, 2, 0, 2, 3, 1, 0, 0, 0, 7, 2, 0, 0, 0, 1, ... |
| 400 | huZEPkPbcgdtSKI7gclDhw | menu | 300.0 | 400.0 | RGB | 3 | [[0.0, 1.0, 112.0, 37.0, 1.0, 0.0, 0.0, 0.0, 1... | [8, 4, 1, 0, 0, 0, 1, 1, 0, 5, 4, 1, 1, 0, 2, ... |
| 219 | MSzntFiQ1aQUohWHNrkt1A | drink | 339.0 | 400.0 | RGB | 0 | [[26.0, 22.0, 8.0, 1.0, 1.0, 8.0, 41.0, 68.0, ... | [0, 0, 0, 0, 2, 4, 6, 2, 0, 1, 1, 0, 0, 0, 2, ... |
Reduction de dimension puis clustering¶
Reduction PCA En concervant 99.0% de la variance, la PCA réduit les features de 494 composantes à 343 composantes Reduction TSNE en 2 dimensions
Clustering Affichage des clusters
Adjusted rand score = 0.068
Clustering par CNN¶
Pretraitement des images¶
Exemple de pré-traitement¶
Image avant traitement
Taille originale => Hauteur: 400, Largeur: 533
Image après traitement
Taille ajustée => Hauteur: 400, Largeur: 600
Creation des features depuis cnn VGG16¶
Principe:
- On utilise VGG16 sans la partie top (sans le reseau dense)
- On extrait un vecteur 1*512 du reseau CNN pour chaque image par prédiction
- Ce vecteur représente les features de l'image: comme pour sift on réduit en 2 dimensions puis on crée des clusters des images
Model: "vgg16"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 400, 600, 3)] 0
block1_conv1 (Conv2D) (None, 400, 600, 64) 1792
block1_conv2 (Conv2D) (None, 400, 600, 64) 36928
block1_pool (MaxPooling2D) (None, 200, 300, 64) 0
block2_conv1 (Conv2D) (None, 200, 300, 128) 73856
block2_conv2 (Conv2D) (None, 200, 300, 128) 147584
block2_pool (MaxPooling2D) (None, 100, 150, 128) 0
block3_conv1 (Conv2D) (None, 100, 150, 256) 295168
block3_conv2 (Conv2D) (None, 100, 150, 256) 590080
block3_conv3 (Conv2D) (None, 100, 150, 256) 590080
block3_pool (MaxPooling2D) (None, 50, 75, 256) 0
block4_conv1 (Conv2D) (None, 50, 75, 512) 1180160
block4_conv2 (Conv2D) (None, 50, 75, 512) 2359808
block4_conv3 (Conv2D) (None, 50, 75, 512) 2359808
block4_pool (MaxPooling2D) (None, 25, 37, 512) 0
block5_conv1 (Conv2D) (None, 25, 37, 512) 2359808
block5_conv2 (Conv2D) (None, 25, 37, 512) 2359808
block5_conv3 (Conv2D) (None, 25, 37, 512) 2359808
block5_pool (MaxPooling2D) (None, 12, 18, 512) 0
global_max_pooling2d (Globa (None, 512) 0
lMaxPooling2D)
=================================================================
Total params: 14,714,688
Trainable params: 14,714,688
Non-trainable params: 0
_________________________________________________________________
Out[50]:
| photo_id | label | width | height | mode | label_num | features | |
|---|---|---|---|---|---|---|---|
| 19 | k6wBjugZfGi1MAIZyYTSBw | inside | 533.0 | 400.0 | RGB | 2 | [37.16306, 51.73715, 53.12264, 19.078745, 33.1... |
| 64 | LmC1GJBNbZypLlXGwSFSuQ | inside | 600.0 | 400.0 | RGB | 2 | [31.974394, 14.303509, 62.38606, 30.424088, 64... |
| 446 | EnRmcgfLoERZshTb2NJBSg | menu | 575.0 | 400.0 | RGB | 3 | [55.595566, 30.554121, 26.499107, 0.0, 28.5950... |
| 78 | ZYpOugZB43r7TF2HwqqauQ | inside | 533.0 | 400.0 | RGB | 2 | [21.98121, 33.64898, 107.18874, 20.312721, 62.... |
| 407 | wPEpDpRggdc1FCqgNHlN3w | menu | 300.0 | 400.0 | RGB | 3 | [32.333332, 31.162899, 16.601622, 0.0, 113.254... |
Test 1: PCA -> TSNE -> KMEAN¶
Reduction PCA En concervant 99.0% de la variance, la PCA réduit les features de 512 composantes à 339 composantes Reduction TSNE en 2 dimensions
Clustering Affichage des clusters
Adjusted rand score = 0.578
Test 2: KMEAN -> PCA -> TSNE¶
Clustering Reduction PCA En concervant 99.0% de la variance, la PCA réduit les features de 512 composantes à 339 composantes Reduction TSNE en 2 dimensions
Affichage des clusters
Adjusted rand score = 0.318
Test 3: PCA (50% variance) -> KMEAN -> TSNE¶
Reduction PCA En concervant 50.0% de la variance, la PCA réduit les features de 512 composantes à 21 composantes Clustering Reduction TSNE en 2 dimensions
Affichage des clusters
Adjusted rand score = 0.269
Récupération des données depuis l'API YELP¶
Principes:
- Faire une 1ere requete sur le point de terminaison "search" pour extraire 200 id de restaurant (une boucle avec offset sera néésssaire car le max par requete est 50)
- Faire une seconde requete en boucle sur les id des restaurants sur le point de terminaison "reviews" (3 reviews max en version gratuite sont données)
- Mettre les data dans des DataFrames puis sauver ces DataFrames en fichier parquet
Lecture depuis les fichiers de la dernière éxécution (à cause de la limite journalière de l'API en version gratuite)
Extrait des reviews provenant de l'API YELP
| text | rating | |
|---|---|---|
| 427 | Absolutely blown away by everything: from the ... | 5 |
| 130 | Great place for food and cocktails, highly rec... | 5 |
| 598 | Good service, food, music and ambiance. I ate ... | 5 |
| 201 | Food was excellent. Taste was just right and n... | 5 |
| 55 | Le Comptoir was recommended to me by a friend.... | 4 |
Il y a 600 enregistrements pour 200 restaurants de la ville de Paris (NB: l'api YELP ne fournit que 3 reviews par business id en version gratuite)